Goto

Collaborating Authors

 cross-modal transformer




2e5c2cb8d13e8fba78d95211440ba326-Supplemental.pdf

Neural Information Processing Systems

Finally, Section E illustrates qualitative results. We present the encoder-decoder variant of HAMT in fine-tuning on the right of Figure 1. Compared to the original cross-modal transformer on the left, the variant removes text-tovision cross-modal attention. The encoder encodes the texts to obtain textual embeddings. Theoriginal target location is viewed as a middle stop point.



Supplementary Material

Neural Information Processing Systems

Section A provides additional details for the method. The scene encoder is to extract the environment information. Following [19], we sample the frames at 2.5 HZ and predict future For ETH and UCY datasets, we adopt the standard metrics ( i . Due to the limitations discussed in Section 4.1, we introudce curve smoothing (CS) into current We conduct experiments on P A V using the traditional ADE/FDE metrics. In particular, our method improves the FDE by 13.6% on PETS.



Towards Interpretable Sleep Stage Classification Using Cross-Modal Transformers

Pradeepkumar, Jathurshan, Anandakumar, Mithunjha, Kugathasan, Vinith, Suntharalingham, Dhinesh, Kappel, Simon L., De Silva, Anjula C., Edussooriya, Chamira U. S.

arXiv.org Artificial Intelligence

Accurate sleep stage classification is significant for sleep health assessment. In recent years, several machine-learning based sleep staging algorithms have been developed , and in particular, deep-learning based algorithms have achieved performance on par with human annotation. Despite improved performance, a limitation of most deep-learning based algorithms is their black-box behavior, which have limited their use in clinical settings. Here, we propose a cross-modal transformer, which is a transformer-based method for sleep stage classification. The proposed cross-modal transformer consists of a novel cross-modal transformer encoder architecture along with a multi-scale one-dimensional convolutional neural network for automatic representation learning. Our method outperforms the state-of-the-art methods and eliminates the black-box behavior of deep-learning models by utilizing the interpretability aspect of the attention modules. Furthermore, our method provides considerable reductions in the number of parameters and training time compared to the state-of-the-art methods. Our code is available at https://github.com/Jathurshan0330/Cross-Modal-Transformer. A demo of our work can be found at https://bit.ly/Cross_modal_transformer_demo.


CTAL: Pre-training Cross-modal Transformer for Audio-and-Language Representations

Li, Hang, Kang, Yu, Liu, Tianqiao, Ding, Wenbiao, Liu, Zitao

arXiv.org Artificial Intelligence

Existing audio-language task-specific predictive approaches focus on building complicated late-fusion mechanisms. However, these models are facing challenges of overfitting with limited labels and low model generalization abilities. In this paper, we present a Cross-modal Transformer for Audio-and-Language, i.e., CTAL, which aims to learn the intra-modality and inter-modality connections between audio and language through two proxy tasks on a large amount of audio-and-language pairs: masked language modeling and masked cross-modal acoustic modeling. After fine-tuning our pre-trained model on multiple downstream audio-and-language tasks, we observe significant improvements across various tasks, such as, emotion classification, sentiment analysis, and speaker verification. On this basis, we further propose a specially-designed fusion mechanism that can be used in fine-tuning phase, which allows our pre-trained model to achieve better performance. Lastly, we demonstrate detailed ablation studies to prove that both our novel cross-modality fusion component and audio-language pre-training methods significantly contribute to the promising results.